To date, little attention has been given to multi-view 3D human mesh estimation, despite real-life applicability (e.g., motion capture, sport analysis) and robustness to single-view ambiguities. Existing solutions typically suffer from poor generalization performance to new settings, largely due to the limited diversity of image-mesh pairs in multi-view training data. To address this shortcoming, people have explored the use of synthetic images. But besides the usual impact of visual gap between rendered and target data, synthetic-data-driven multi-view estimators also suffer from overfitting to the camera viewpoint distribution sampled during training which usually differs from real-world distributions. Tackling both challenges, we propose a novel simulation-based training pipeline for multi-view human mesh recovery, which (a) relies on intermediate 2D representations which are more robust to synthetic-to-real domain gap; (b) leverages learnable calibration and triangulation to adapt to more diversified camera setups; and (c) progressively aggregates multi-view information in a canonical 3D space to remove ambiguities in 2D representations. Through extensive benchmarking, we demonstrate the superiority of the proposed solution especially for unseen in-the-wild scenarios.
translated by 谷歌翻译
This paper presents a Generative RegIon-to-Text transformer, GRiT, for object understanding. The spirit of GRiT is to formulate object understanding as <region, text> pairs, where region locates objects and text describes objects. For example, the text in object detection denotes class names while that in dense captioning refers to descriptive sentences. Specifically, GRiT consists of a visual encoder to extract image features, a foreground object extractor to localize objects, and a text decoder to generate open-set object descriptions. With the same model architecture, GRiT can understand objects via not only simple nouns, but also rich descriptive sentences including object attributes or actions. Experimentally, we apply GRiT to object detection and dense captioning tasks. GRiT achieves 60.4 AP on COCO 2017 test-dev for object detection and 15.5 mAP on Visual Genome for dense captioning. Code is available at https://github.com/JialianW/GRiT
translated by 谷歌翻译
了解动态场景中的3D运动对于许多视觉应用至关重要。最近的进步主要集中在估计人类等某些特定元素的活动上。在本文中,我们利用神经运动场来估计多视图设置中所有点的运动。由于颜色相似的点和与时变颜色的点的歧义,从动态场景中对动态场景进行建模运动是具有挑战性的。我们建议将估计运动的正规化为可预测。如果已知来自以前的帧的运动,那么在不久的将来的运动应该是可以预测的。因此,我们通过首先调节潜在嵌入的估计运动来引入可预测性正则化,然后通过采用预测网络来在嵌入式上执行可预测性。所提出的框架pref(可预测性正则化字段)比基于最先进的神经运动场的动态场景表示方法在PAR或更好的结果上取得了更好的成绩,同时不需要对场景的先验知识。
translated by 谷歌翻译
尽管基于3D点云表示的基于自我监督的对比度学习模型最近取得了成功,但此类预训练模型的对抗性鲁棒性引起了人们的关注。对抗性对比学习(ACL)被认为是改善预训练模型的鲁棒性的有效方法。相比之下,投影仪被认为是在对比度预处理过程中删除不必要的特征信息的有效组成部分,并且大多数ACL作品还使用对比度损失,与预测的功能表示形式相比损失,在预处理中产生对抗性示例,而“未转移”的功能表征用于发电的对抗性输入。在推理期间。由于投影和“未投影”功能之间的分布差距,其模型受到限制,以获取下游任务的可靠特征表示。我们介绍了一种新方法,通过利用虚拟对抗性损失在对比度学习框架中使用“未重新注射”功能表示,以生成高质量的3D对抗示例,以进行对抗训练。我们介绍了强大的意识损失功能,以对抗自我监督对比度学习框架。此外,我们发现选择具有正常操作员(DON)操作员差异的高差异作为对抗性自学对比度学习的附加输入,可以显着提高预训练模型的对抗性鲁棒性。我们在下游任务上验证我们的方法,包括3D分类和使用多个数据集的3D分割。它在最先进的对抗性学习方法上获得了可比的鲁棒精度。
translated by 谷歌翻译
我们提出了一种方法,用于估计具有单个RGB图像的可用3D模型的刚性对象的6DOF姿势。与基于经典对应的方法不同,该方法可以预测输入图像的像素的3D对象坐标,该建议的方法可以预测3D对象坐标在相机frustum中采样的3D查询点。从像素到3D点的移动,这是受到3D重建方法的最新PIFU式方法的启发,可以对整个对象(包括(自我)遮挡部分)进行推理。对于与与像素对齐的图像功能相关的3D查询点,我们训练完全连接的神经网络来预测:(i)相应的3D对象坐标,以及(ii)签名到对象表面的签名距离,首先定义仅适用于地表附近的查询点。我们将该网络实现的映射称为神经通信字段。然后,通过Kabsch-Ransac算法从预测的3D-3D对应关系中稳健地估计对象姿势。所提出的方法在三个BOP数据集上实现了最先进的结果,并且在咬合挑战性案例中表现出了优越。项目网站在:linhuang17.github.io/ncf。
translated by 谷歌翻译
变压器跟踪器最近取得了令人印象深刻的进步,注意力机制起着重要作用。但是,注意机制的独立相关计算可能导致嘈杂和模棱两可的注意力重量,从而抑制了进一步的性能改善。为了解决这个问题,我们提出了注意力(AIA)模块,该模块通过在所有相关向量之间寻求共识来增强适当的相关性并抑制错误的相关性。我们的AIA模块可以很容易地应用于自我注意解区和交叉注意区块,以促进特征聚集和信息传播以进行视觉跟踪。此外,我们通过引入有效的功能重复使用和目标背景嵌入来充分利用时间参考,提出了一个流线型的变压器跟踪框架,称为AIATRACK。实验表明,我们的跟踪器以实时速度运行时在六个跟踪基准测试中实现最先进的性能。
translated by 谷歌翻译
多任务学习(MTL)范式着重于共同学习两个或多个任务,旨在重大改进W.R.T模型的通用性,性能和培训/推理记忆足迹。对于与视觉相关的{\ bf密集}的预测任务的联合培训,上述好处是必不可少的。在这项工作中,我们解决了两个密集任务的MTL问题,即\ ie,语义细分和深度估计,并提出了一个新颖的注意模块,称为跨通道注意模块({CCAM}),可促进沿着每个频道之间的有效特征共享这两个任务,导致相互绩效增长,可训练的参数可忽略不计。然后,我们以一种真正的共生精神,使用称为{affinemix}的预测深度为语义分割任务制定新的数据增强,并使用称为{coloraug}的预测语义进行了简单的深度增强。最后,我们验证了CityScapes数据集上提出的方法的性能增益,这有助于我们基于深度和语义分割的半监督联合模型实现最新结果。
translated by 谷歌翻译
在本文中,我们为复杂场景进行了高效且强大的深度学习解决方案。在我们的方法中,3D场景表示为光场,即,一组光线,每组在到达图像平面时具有相应的颜色。对于高效的新颖视图渲染,我们采用了光场的双面参数化,其中每个光线的特征在于4D参数。然后,我们将光场配向作为4D函数,即将4D坐标映射到相应的颜色值。我们训练一个深度完全连接的网络以优化这种隐式功能并记住3D场景。然后,特定于场景的模型用于综合新颖视图。与以前需要密集的视野的方法不同,需要密集的视野采样来可靠地呈现新颖的视图,我们的方法可以通过采样光线来呈现新颖的视图并直接从网络查询每种光线的颜色,从而使高质量的灯场呈现稀疏集合训练图像。网络可以可选地预测每光深度,从而使诸如自动重新焦点的应用。我们的小说视图合成结果与最先进的综合结果相当,甚至在一些具有折射和反射的具有挑战性的场景中优越。我们在保持交互式帧速率和小的内存占地面积的同时实现这一点。
translated by 谷歌翻译
Graph Neural Networks (GNNs) have shown satisfying performance on various graph learning tasks. To achieve better fitting capability, most GNNs are with a large number of parameters, which makes these GNNs computationally expensive. Therefore, it is difficult to deploy them onto edge devices with scarce computational resources, e.g., mobile phones and wearable smart devices. Knowledge Distillation (KD) is a common solution to compress GNNs, where a light-weighted model (i.e., the student model) is encouraged to mimic the behavior of a computationally expensive GNN (i.e., the teacher GNN model). Nevertheless, most existing GNN-based KD methods lack fairness consideration. As a consequence, the student model usually inherits and even exaggerates the bias from the teacher GNN. To handle such a problem, we take initial steps towards fair knowledge distillation for GNNs. Specifically, we first formulate a novel problem of fair knowledge distillation for GNN-based teacher-student frameworks. Then we propose a principled framework named RELIANT to mitigate the bias exhibited by the student model. Notably, the design of RELIANT is decoupled from any specific teacher and student model structures, and thus can be easily adapted to various GNN-based KD frameworks. We perform extensive experiments on multiple real-world datasets, which corroborates that RELIANT achieves less biased GNN knowledge distillation while maintaining high prediction utility.
translated by 谷歌翻译
To generate high quality rendering images for real time applications, it is often to trace only a few samples-per-pixel (spp) at a lower resolution and then supersample to the high resolution. Based on the observation that the rendered pixels at a low resolution are typically highly aliased, we present a novel method for neural supersampling based on ray tracing 1/4-spp samples at the high resolution. Our key insight is that the ray-traced samples at the target resolution are accurate and reliable, which makes the supersampling an interpolation problem. We present a mask-reinforced neural network to reconstruct and interpolate high-quality image sequences. First, a novel temporal accumulation network is introduced to compute the correlation between current and previous features to significantly improve their temporal stability. Then a reconstruct network based on a multi-scale U-Net with skip connections is adopted for reconstruction and generation of the desired high-resolution image. Experimental results and comparisons have shown that our proposed method can generate higher quality results of supersampling, without increasing the total number of ray-tracing samples, over current state-of-the-art methods.
translated by 谷歌翻译